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Abstract 

There is plenty of theoretical and empirical evi¬ 
dence that depth of neural networks is a crucial 
ingredient for their success. However, network 
training becomes more difficult with increasing 
depth and training of very deep networks remains 
an open problem. In this extended abstract, we 
introduce a new architecture designed to ease 
gradient-based training of very deep networks. 

We refer to networks with this architecture as 
highway networks, since they allow unimpeded 
information flow across several layers on infor¬ 
mation highways. The architecture is character¬ 
ized by the use of gating units which learn to reg¬ 
ulate the flow of information through a network. 
Highway networks with hundreds of layers can 
be trained directly using stochastic gradient de¬ 
scent and with a variety of activation functions, 
opening up the possibility of studying extremely 
deep and efficient architectures. 

Note: A full paper extending this study is available at 

http : //arxiv. org/abs/1507.06228, with addi¬ 
tional references, experiments and analysis. 

1. Introduction 

Many recent empirical breakthroughs in supervised ma¬ 
chine learning have been achieved through the applica¬ 
tion of deep neural networks. Network depth (referring to 
the number of successive computation layers) has played 
perhaps the most important role in these successes. For 
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instance, the top-5 image classification accuracy on the 
1000-class ImageNet dataset has increased from ^84% 
(Krizhevsky et al., 2012) to ^95% (Szegedy et al., 2014; 
Simony an & Zisserman, 2014) through the use of ensem¬ 
bles of deeper architectures and smaller receptive fields 
(Ciresan et al., 2011a;b; 2012) in just a few years. 

On the theoretical side, it is well known that deep net¬ 
works can represent certain function classes exponentially 
more efficiently than shallow ones (e.g. the work of Hastad 
(1987); Hastad & Goldmann (1991) and recently of Mont- 
ufar et al. (2014)). As argued by Bengio et al. (2013), the 
use of deep networks can offer both computational and sta¬ 
tistical efficiency for complex tasks. 

However, training deeper networks is not as straightfor¬ 
ward as simply adding layers. Optimization of deep net¬ 
works has proven to be considerably more difficult, lead¬ 
ing to research on initialization schemes (Glorot & Ben¬ 
gio, 2010; Saxe et al., 2013; He et al., 2015), techniques 
of training networks in multiple stages (Simonyan & Zis¬ 
serman, 2014; Romero et al., 2014) or with temporary 
companion loss functions attached to some of the layers 
(Szegedy et al., 2014; Lee et al., 2015). 

In this extended abstract, we present a novel architecture 
that enables the optimization of networks with virtually ar¬ 
bitrary depth. This is accomplished through the use of a 
learned gating mechanism for regulating information flow 
which is inspired by Long Short Term Memory recurrent 
neural networks (Hochreiter & Schmidhuber, 1995). Due 
to this gating mechanism, a neural network can have paths 
along which information can flow across several layers 
without attenuation. We call such paths information high¬ 
ways , and such networks highway networks. 

In preliminary experiments, we found that highway net¬ 
works as deep as 900 layers can be optimized using simple 
Stochastic Gradient Descent (SGD) with momentum. For 
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up to 100 layers we compare their training behavior to that 
of traditional networks with normalized initialization (Glo- 
rot & Bengio, 2010; He et al., 2015). We show that opti¬ 
mization of highway networks is virtually independent of 
depth, while for traditional networks it suffers significantly 
as the number of layers increases. We also show that archi¬ 
tectures comparable to those recently presented by Romero 
et al. (2014) can be directly trained to obtain similar test 
set accuracy on the CIFAR-10 dataset without the need for 
a pre-trained teacher network. 

1.1. Notation 

We use boldface letters for vectors and matrices, and ital¬ 
icized capital letters to denote transformation functions. 0 
and 1 denote vectors of zeros and ones respectively, and I 
denotes an identity matrix. The function a(x) is defined as 

c(z) = i + e -* i x G R - 

2. Highway Networks 

A plain feedforward neural network typically consists of L 
layers where the I th layer (/ E {1,2,..., L }) applies a non¬ 
linear transform H (parameterized by Wh,i) on its input 
xi to produce its output yi. Thus, xi is the input to the 
network and yL is the network’s output. Omitting the layer 
index and biases for clarity, 

y = J ff(x,W H ). (1) 

H is usually an affine transform followed by a non-linear 
activation function, but in general it may take other forms. 

For a highway network, we additionally define two non¬ 
linear transforms T(x, Wt) and C(x,Wc) such that 

y = if(x, W H ) - T(x, W t ) + x • C(x, W c ). (2) 

We refer to T as the transform gate and C as the carry gate, 
since they express how much of the output is produced by 
transforming the input and carrying it, respectively. For 
simplicity, in this paper we set C = 1 — T, giving 

y = (x, W H ) - T(x, W t ) + x • (1 - T(x, W T )). (3) 

The dimensionality of x, y, iT(x, Wh) and T(x, Wt) 
must be the same for Equation (3) to be valid. Note that 
this re-parametrization of the layer transformation is much 
more flexible than Equation (1). In particular, observe that 

X, ifr(x,W T ) = 0, 

ff(x,W H ), ifr(x,W T ) = i. 


Similarly, for the Jacobian of the layer transform, 

dyfl, if T(x, W T ) = 0, 

dx W H ), if T(x, W t ) = 1. 

Thus, depending on the output of the transform gates, a 
highway layer can smoothly vary its behavior between that 
of a plain layer and that of a layer which simply passes 
its inputs through. Just as a plain layer consists of multi¬ 
ple computing units such that the i th unit computes yi = 
a highway network consists of multiple blocks such 
that the i th block computes a block state Hi (x) and trans¬ 
form gate output Ti (x) . Finally, it produces the block out¬ 
put yi = Hi(x) * Ti(x) + Xi * (1 — Ti(x)), which is con¬ 
nected to the next layer. 

2.1. Constructing Highway Networks 

As mentioned earlier, Equation (3) requires that the dimen¬ 
sionality of x, y, H(x, Wh) and T(x, Wt) be the same. 
In cases when it is desirable to change the size of the rep¬ 
resentation, one can replace x with x obtained by suitably 
sub-sampling or zero-padding x. Another alternative is to 
use a plain layer (without highways) to change dimension¬ 
ality and then continue with stacking highway layers. This 
is the alternative we use in this study. 

Convolutional highway layers are constructed similar to 
fully connected layers. Weight-sharing and local receptive 
fields are utilized for both H and T transforms. We use 
zero-padding to ensure that the block state and transform 
gate feature maps are the same size as the input. 

2.2. Training Deep Highway Networks 

For plain deep networks, training with SGD stalls at the 
beginning unless a specific weight initialization scheme is 
used such that the variance of the signals during forward 
and backward propagation is preserved initially (Glorot & 
Bengio, 2010; He et al., 2015). This initialization depends 
on the exact functional form of H. 

For highway layers, we use the transform gate defined as 
T(x) = ct(Wt T x + Iot), where Wt is the weight matrix 
and b T the bias vector for the transform gates. This sug¬ 
gests a simple initialization scheme which is independent 
of the nature of H: br can be initialized with a negative 
value (e.g. -1,-3 etc.) such that the network is initially 
biased towards carry behavior. This scheme is strongly in¬ 
spired by the proposal of Gers et al. (1999) to initially bias 
the gates in a Long Short-Term Memory recurrent network 
to help bridge long-term temporal dependencies early in 
learning. Note that a(x) E (0,1), Vx £ M, so the condi¬ 
tions in Equation (4) can never be exactly true. 

In our experiments, we found that a negative bias initial- 
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ization was sufficient for learning to proceed in very deep 
networks for various zero-mean initial distributions of Wh 
and different activation functions used by H. This is sig¬ 
nificant property since in general it may not be possible to 
find effective initialization schemes for many choices of H. 

3. Experiments 

3.1. Optimization 

Very deep plain networks become difficult to optimize even 
if using the variance-preserving initialization scheme form 
(He et al., 2015). To show that highway networks do not 
suffer from depth in the same way we train run a series 
of experiments on the MNIST digit classification dataset. 
We measure the cross entropy error on the training set, to 
investigate optimization, without conflating them with gen¬ 
eralization issues. 

We train both plain networks and highway networks with 
the same architecture and varying depth. The first layer is 
always a regular fully-connected layer followed by 9, 19, 
49, or 99 fully-connected plain or highway layers and a 
single softmax output layer. The number of units in each 
layer is kept constant and it is 50 for highways and 71 
for plain networks. That way the number of parameters 
is roughly the same for both. To make the comparison fair 
we run a random search of 40 runs for both plain and high¬ 
way networks to find good settings for the hyperparame¬ 
ters. We optimized the initial learning rate, momentum, 
learning rate decay rate, activation function for H (either 
ReLU or tank) and, for highway networks, the value for 
the transform gate bias (between -1 and -10). All other 
weights were initialized following the scheme introduced 
by (He et al., 2015). 

The convergence plots for the best performing networks for 
each depth can be seen in Figure 1. While for 10 layers 
plain network show very good performance, their perfor¬ 
mance significantly degrades as depth increases. Highway 
networks on the other hand do not seem to suffer from an 
increase in depth at all. The final result of the 100 layer 
highway network is about 1 order of magnitude better than 
the 10 layer one, and is on par with the 10 layer plain net¬ 
work. In fact, we started training a similar 900 layer high¬ 
way network on CIFAR-100 which is only at 80 epochs 
as of now, but so far has shown no signs of optimization 
difficulties. It is also worth pointing out that the highway 
networks always converge significantly faster than the plain 
ones. 

3.2. Comparison to Fitnets 

Deep highway networks are easy to optimize, but are they 
also beneficial for supervised learning where we are in¬ 
terested in generalization performance on a test set? To 


address this question, we compared highway networks to 
the thin and deep architectures termed Fitnets proposed re¬ 
cently by Romero et al. (2014) on the CIFAR-10 dataset 
augmented with random translations. Results are summa¬ 
rized in Table 1 . 

Romero et al. (2014) reported that training using plain 
backpropogation was only possible for maxout networks 
with depth up to 5 layers when number of parameters was 
limited to ~250K and number of multiplications to ^30M. 
Training of deeper networks was only possible through the 
use of a two-stage training procedure and addition of soft 
targets produced from a pre-trained shallow teacher net¬ 
work (hint-based training). Similarly it was only possible 
to train 19-layer networks with a budget of 2.5M parame¬ 
ters using hint-based training. 

We found that it was easy to train highway networks with 
number of parameters and operations comparable to fit- 
nets directly using backpropagation. As shown in Table 1, 
Highway 1 and Highway 4, which are based on the archi¬ 
tecture of Fitnet 1 and Fitnet 4 respectively obtain similar 
or higher accuracy on the test set. We were also able to 
train thinner and deeper networks: a 19-layer highway net¬ 
work with ~1.4M parameters and a 32-layer highway net¬ 
work with ~1.25M parameter both perform similar to the 
teacher network of Romero et al. (2014). 

4. Analysis 

In Figure 2 we show some inspections on the inner work¬ 
ings of the best 1 50 hidden layer fully-connected high¬ 
way networks trained on MNIST (top row) and CIFAR- 
100 (bottom row). The first three columns show, for each 
transform gate, the bias, the mean activity over 10K ran¬ 
dom samples, and the activity for a single random sample 
respectively. The block outputs for the same single sample 
are displayed in the last column. 

The transform gate biases of the two networks were initial¬ 
ized to -2 and -4 respectively. It is interesting to note that 
contrary to our expectations most biases actually decreased 
further during training. For the CIFAR-100 network the bi¬ 
ases increase with depth forming a gradient. Curiously this 
gradient is inversely correlated with the average activity of 
the transform gates as seen in the second column. This in¬ 
dicates that the strong negative biases at low depths are not 
used to shut down the gates, but to make them more selec¬ 
tive. This behavior is also suggested by the fact that the 
transform gate activity for a single example (column 3) is 
very sparse. This effect is more pronounced for the CIFAR- 
100 network, but can also be observed to a lesser extent in 
the MNIST network. 

Obtained via random search over hyperparameters to mini¬ 
mize the best training set error achieved using each configuration 
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Figure 1. Comparison of optimization of plain networks and highway networks of various depths. All networks were optimized using 
SGD with momentum. The curves shown are for the best hyperparameter settings obtained for each configuration using a random 
search. Plain networks become much harder to optimize with increasing depth, while highway networks with up to 100 layers can still 
be optimized well. 


Network 

Number of Layers 

Number of Parameters 

Accuracy 

Fitnet Results reported by Romero et al. (2014) 

Teacher 

5 

~9M 

90.18% 

Fitnet 1 

11 

-250K 

89.01% 

Fitnet 2 

11 

-862K 

91.06% 

Fitnet 3 

13 

-1.6M 

91.10% 

Fitnet 4 

19 

~2.5M 

91.61% 

Highway networks 

Highway 1 (Fitnet 1) 

ii 

-236K 

89.18% 

Highway 2 (Fitnet 4) 

19 

-2.3M 

92.24% 

Highway 3* 

19 

-1.4M 

90.68% 

Highway 4* 

32 

-1.25M 

90.34% 


Table 1. CIFAR-10 test set accuracy of convolutional highway networks with rectified linear activation and sigmoid gates. For compar¬ 
ison, results reported by Romero et al. (2014) using maxout networks are also shown. Fitnets were trained using a two step training 
procedure using soft targets from the trained Teacher network, which was trained using backpropagation. We trained all highway net¬ 
works directly using backpropagation. * indicates networks which were trained only on a set of 40K out of 50K examples in the training 
set. 


The last column of Figure 2 displays the block outputs and 
clearly visualizes the concept of “information highways”. 
Most of the outputs stay constant over many layers form¬ 
ing a pattern of stripes. Most of the change in outputs hap¬ 
pens in the early layers (« 10 for MNIST and « 30 for 
CIFAR-100). We hypothesize that this difference is due to 
the higher complexity of the CIFAR-100 dataset. 

In summary it is clear that highway networks actually uti¬ 
lize the gating mechanism to pass information almost un¬ 
changed through many layers. This mechanism serves not 
just as a means for easier training, but is also heavily used 
to route information in a trained network. We observe very 
selective activity of the transform gates, varying strongly in 
reaction to the current input patterns. 


5. Conclusion 

Learning to route information through neural networks has 
helped to scale up their application to challenging prob¬ 
lems by improving credit assignment and making training 
easier (Srivastava et al., 2015). Even so, training very deep 
networks has remained difficult, especially without consid¬ 
erably increasing total network size. 

Highway networks are novel neural network architectures 
which enable the training of extremely deep networks us¬ 
ing simple SGD. While the traditional plain neural archi¬ 
tectures become increasingly difficult to train with increas¬ 
ing network depth (even with variance-preserving initial¬ 
ization), our experiments show that optimization of high¬ 
way networks is not hampered even as network depth in¬ 
creases to a hundred layers. 

The ability to train extremely deep networks opens up the 
possibility of studying the impact of depth on complex 
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Figure 2. Visualization of certain internals of the blocks in the best 50 hidden layer highway networks trained on MNIST (top row) and 
CIFAR-100 (bottom row). The first hidden layer is a plain layer which changes the dimensionality of the representation to 50. Each of 
the 49 highway layers (y-axis) consists of 50 blocks (x-axis). The first column shows the transform gate biases, which were initialized 
to -2 and -4 respectively. In the second column the mean output of the transform gate over 10,000 training examples is depicted. The 
third and forth columns show the output of the transform gates and the block outputs for a single random training sample. 


problems without restrictions. Various activation functions 
which may be more suitable for particular problems but for 
which robust initialization schemes are unavailable can be 
used in deep highway networks. Future work will also at¬ 
tempt to improve the understanding of learning in highway 
networks. 
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